We have data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They performed barbell lifts correctly and incorrectly in 5 different ways. Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistake. The training dataset is taken from here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Our goal is to create model and predict class of test exercises.
There are many variables in training dataset, it is difficult to create model:
dim(dfPMLTraining)
## [1] 19622 160
Let’s find and remove some not usable variables.
We can see that participants performed the exercises sequentially, one after another. All participants performed exercises of all classes (A, B, C, D, E). Variable X is just an order number of rows.
dfPMLTrainingEx <- subset(dfPMLTraining,
select = c(-X, -user_name,
-raw_timestamp_part_1,
-raw_timestamp_part_2,
-cvtd_timestamp,
-new_window,-num_window))
There are variables in dataset, that have mostly NA or blank (equal to "") values.
vNotEmptyColumns <- sapply(dfPMLTrainingEx,
function (x) {
(sum(is.na(x) | x == "")) < 0.9*nrow(dfPMLTrainingEx)
})
table(vNotEmptyColumns)
## vNotEmptyColumns
## FALSE TRUE
## 100 53
We can remove this 100 variables too.
dfPMLTrainingEx <- dfPMLTrainingEx[,vNotEmptyColumns]
dim(dfPMLTrainingEx)
## [1] 19622 53
Now we have 53 variables to create the model instead of 160 in original dataset.
Let’s split dataset (75% for training and 25% for validation), create several models on training partition and test them on validation partition.
inTraining <- createDataPartition(y = dfPMLTrainingEx$classe,p = .75,list = F)
dfPMLTrainingExTr <- dfPMLTrainingEx[inTraining,]
dfPMLTrainingExTst <- dfPMLTrainingEx[-inTraining,]
rm(dfPMLTraining, dfPMLTrainingEx)
gc()
fitRF <- train(classe ~ ., method = "rf", data = dfPMLTrainingExTr)
predRF <- predict(fitRF, dfPMLTrainingExTst)
confMatrRF <- confusionMatrix(predRF, dfPMLTrainingExTst$classe)
strLabelRF <- fitRF$modelInfo$label
strAccRF <- confMatrRF$overall[1]
rm(predRF, confMatrRF)
gc()
fitTR <- train(classe ~ ., method = "rpart", data = dfPMLTrainingExTr)
predTR <- predict(fitTR, dfPMLTrainingExTst)
confMatrTR <- confusionMatrix(predTR, dfPMLTrainingExTst$classe)
strLabelTR <- fitTR$modelInfo$label
strAccTR <- confMatrTR$overall[1]
rm(fitTR, predTR, confMatrTR)
gc()
fitBS <- train(classe ~ ., method = "gbm", data = dfPMLTrainingExTr, verbose = FALSE)
predBS <- predict(fitBS, dfPMLTrainingExTst)
confMatrBS <- confusionMatrix(predBS, dfPMLTrainingExTst$classe)
strLabelBS <- fitBS$modelInfo$label
strAccBS <- confMatrBS$overall[1]
rm(fitBS, predBS, confMatrBS)
gc()
fitLDA <- train(classe ~ ., method = "lda", data = dfPMLTrainingExTr)
predLDA <- predict(fitLDA, dfPMLTrainingExTst)
confMatrLDA <- confusionMatrix(predLDA, dfPMLTrainingExTst$classe)
strLabelLDA <- fitLDA$modelInfo$label
strAccLDA <- confMatrLDA$overall[1]
rm(fitLDA, predLDA, confMatrLDA)
gc()
Now comapre Accuracy of predictions.
## MethodName Accuracy
## 1 Random Forest 0.9951060
## 2 CART 0.4951060
## 3 Stochastic Gradient Boosting 0.9692088
## 4 Linear Discriminant Analysis 0.7014682
The best result give a Random Forest method.
Now we can predict class of 20 test measurements.
predRFfinal <- predict(fitRF, dfPMLTesting)
predRFfinal
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E